**1. Redesign the post process logic**

In the original design, the DLA computes the post process, like bias, batch normalization, element-wise addition and multiplication, after the convolution or FC produced. The PE array (for convolution and FC) and post-process elements are pipelined for real-time ASR. However, this design requires sufficient sources.

In wavenet, the number of multiplication is detailed in Eq.1

![](data:image/x-wmf;base64,183GmgAAAAAAAAMXewLsCQAAAACFSwEACQAAA30BAAACABwAAAAAAAUAAAAJAgAAAAAFAAAAAgEBAAAABQAAAAEC////AAUAAAAuARgAAAAFAAAACwIAAAAABQAAAAwCQALgFBIAAAAmBg8AGgD/////AAAQAAAAwP///6b///+gFAAA5gEAAAsAAAAmBg8ADABNYXRoVHlwZQAAYAAcAAAA+wIg/wAAAAAAAJABAQAAAAQCABBUaW1lcyBOZXcgUm9tYW4A+NgZAJc0W3ZAfWN2AAAAAAQAAAAtAQAACQAAADIK4AF3EwMAAABvdXQACAAAADIK4AF1EAIAAABpbgkAAAAyCuABmAoDAAAAb3V0ZRwAAAD7AoD+AAAAAAAAkAEBAAAABAIAEFRpbWVzIE5ldyBSb21hbgD42BkAlzRbdkB9Y3YAAAAABAAAAC0BAQAEAAAA8AEAAAgAAAAyCoABcxIBAAAAQ3UIAAAAMgqAAXQPAQAAAEN1CAAAADIKgAEoDQEAAABLdQgAAAAyCoABZAkBAAAAVxIJAAAAMgqAASwFBAAAAGNvbnYJAAAAMgqAASoBBAAAAG11bHQcAAAA+wKA/gAAAAAAAJABAAAAAgQCABBTeW1ib2wAdug6WnZAAAAA+NgZAJc0W3ZAfWN2AAAAAAQAAAAtAQAABAAAAPABAQAIAAAAMgqAAXcRAQAAALR1CAAAADIKgAF4DgEAAAC0AAgAAAAyCoABFAwBAAAAtHUIAAAAMgqAAVwIAQAAAD11HAAAAPsCgP4AAAAAAACQAQAAAAAEAgAQVGltZXMgTmV3IFJvbWFuAPjYGQCXNFt2QH1jdgAAAAAEAAAALQEBAAQAAADwAQAACAAAADIKgAEqBAEAAABfzggAAAAyCoABOgABAAAAI3UKAAAAJgYPAAoA/////wEAAAAAABwAAAD7AhAABwAAAAAAvAIAAACGAQICIlN5c3RlbQANcQWKAAAACgC5FmYNuRZmDXEFigBg2hkABAAAAC0BAAAEAAAA8AEBAAMAAAAAAA==) (1)

Where Wout is the length of output time sequence. K is the Kernel Size. Cin and Cout are the numbers of input and output channel, respectively.

However, the number of multiplication in BN or Element-wise operation or Activation Function, such as sigmoid and tanh, is calculated in Eq.2 as

![](data:image/x-wmf;base64,183GmgAAAAAAABYRewLsCQAAAACQTQEACQAAA1UBAAACABwAAAAAAAUAAAAJAgAAAAAFAAAAAgEBAAAABQAAAAEC////AAUAAAAuARgAAAAFAAAACwIAAAAABQAAAAwCQAKADxIAAAAmBg8AGgD/////AAAQAAAAwP///6b///9ADwAA5gEAAAsAAAAmBg8ADABNYXRoVHlwZQAAYAAcAAAA+wIg/wAAAAAAAJABAQAAAAQCABBUaW1lcyBOZXcgUm9tYW4AENkZAJc0W3ZAfWN2AAAAAAQAAAAtAQAACQAAADIK4AEIDgMAAABvdXRlCQAAADIK4AGMCgMAAABvdXRlHAAAAPsCgP4AAAAAAACQAQEAAAAEAgAQVGltZXMgTmV3IFJvbWFuABDZGQCXNFt2QH1jdgAAAAAEAAAALQEBAAQAAADwAQAACAAAADIKgAEEDQEAAABDdQgAAAAyCoABWAkBAAAAV3UJAAAAMgqAAWIFBAAAAHBvc3QJAAAAMgqAASoBBAAAAG11bHQcAAAA+wKA/gAAAAAAAJABAAAAAgQCABBTeW1ib2wAdug6WnZAAAAAENkZAJc0W3ZAfWN2AAAAAAQAAAAtAQAABAAAAPABAQAIAAAAMgqAAQgMAQAAALR1CAAAADIKgAFQCAEAAAA9dRwAAAD7AoD+AAAAAAAAkAEAAAAABAIAEFRpbWVzIE5ldyBSb21hbgAQ2RkAlzRbdkB9Y3YAAAAABAAAAC0BAQAEAAAA8AEAAAgAAAAyCoABKgQBAAAAXygIAAAAMgqAAToAAQAAACN1CgAAACYGDwAKAP////8BAAAAAAAcAAAA+wIQAAcAAAAAALwCAAAAhgECAiJTeXN0ZW0AxHEFigAAAAoA5RNmxOUTZsRxBYoAeNoZAAQAAAAtAQAABAAAAPABAQADAAAAAAA=) (2)

Hence, the computation in post process only accounts for a relatively modest portion of the whole layer.

Similar result can be obtained in FC/LSTM.

Therefore, I make a compromise between the one-time pipeline logic and one-after-another operation flow in reference [1]. The bias or BN can be computed at once when the convolution or FC is finished. However, the features should be accessed again from the Feature Buffer to do the element-wise operation. In this way, 32 x 2 floating-point multipliers and adders are saved and the post-processed logic are fully loaded when convolution or FC is under calculation.

Reference:

[1] Laika: A 5uW programmable LSTM accelerator for always-on keyword spotting in 65nm CMOS.

**2. Transmit the DLA SoC slave interface to AXI bus and the DLA DDR master interface to PHI bus for DDR.**

This work is undergoing. As my current laptop can not P&R the relative Xilinx IPs because of the meager CPU and DDR capacity, I am reviewing the previous work in Tanji3. I will get a more power computer in Friday. Hope it can run the Vivado P&R.

**3. If anyone can enter the lab, please reset my computer to UNIX and start the teamvierwe.**